Load required libraries

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(scater)
  library(ggplot2)
  library(scran)
  library(batchelor)
  library(BiocSingular)
  library(BiocNeighbors)
  library(Rtsne)
  library(rsvd)
  library(tidyr) 
  library(dplyr)
  library(igraph)
})

Why single-cell assays?

(Image credit: 10x Genomics: Getting started with single cell gene expression)

Generating scRNA-seq data

Unique molecular identifiers

  • Because of the small amount of input material, strong amplification is required during library preparation
  • By attaching a tag to each original transcript molecule, before the amplification, reads generated from the same molecule can be identified and collapsed at the quantification step
  • These tags are called unique molecular identifiers (UMIs), and are part of many of the single-cell sequencing protocols (in particular the droplet-based ones)

Example read structure (10x Genomics Chromium v3)

(Image credit: 10x Genomics)

  • The sample index identifies the library (in case of multiplexing)
  • The 10x barcode (or cell index) identifies the cell within the library
  • The UMI identifies the transcript molecule within a cell and gene
  • The insert is the actual cDNA sequence

You typically receive three fastq files per sample from the sequencing facility:

  • sample_I1.fastq.gz: contains sample index
  • sample_R1.fastq.gz: contains 10x barcode + UMI
  • sample_R2.fastq.gz: contains insert sequence

Data from plate-based protocols such as Smart-Seq2 are more similar to bulk data, and you typically get one fastq file per cell.

Representation of scRNA-seq data

(Original image credit: Jim Hester / https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)

Example data

Load the dataset

The dataset that we will use is a composite dataset of three independent 10x runs originating from three different labs. It consists of mammary epithelial cells, sequenced using 10x Genomics technology, which has already been pre-filtered to include only cells that are assigned unambiguously to one of three major cell types: luminal progenitors, luminal mature and basal.

Next, we load the preprocessed dataset and have a first look at the composition of the dataset:

## Download the data and set row names to gene symbols whenever possible
sce <- readRDS(gzcon(url("https://ppapasaikas.github.io/SC_datasets/Data/SCE_MammaryGland_prefiltered.rds?raw=true")))
rownames(sce) <- scater::uniquifyFeatureNames(
  ID = rownames(sce), 
  names = as.character(rowData(sce)$gene.symbols)
)
#Subsample cells to speed up processing:
set.seed(42)
n=3000
sce <- sce[, sample(1:ncol(sce), n )  ]
## Dataset compostion per cell type and study:  
table(colData(sce)$study , colData(sce)$cell.class)
##      
##       luminal_progenitor basal luminal_mature
##   spk                392   145            352
##   vis                228   408            350
##   wal                235   241            649

We can see that sce is indeed a SingleCellExperiment object.

sce
## class: SingleCellExperiment 
## dim: 5092 3000 
## metadata(0):
## assays(2): counts logcounts
## rownames(5092): Malat1 Rpl41 ... Slc45a3 Acsbg1
## rowData names(2): gene.symbols gene.counts
## colnames(3000): vis2558 spk2385 ... wal2232 wal1284
## colData names(4): study cell.class library_size detected_genes
## reducedDimNames(0):
## altExpNames(0):

Accessing the assay data, row and column annotations is done in the same way as for SummarizedExperiment objects. There is also an additional counts() accessor for the counts assay.


Check that the counts() accessor returns the same matrix as you would get if you access the counts assay using assay() or assays().

identical(assay(sce, "counts"), counts(sce))
## [1] TRUE
all(assay(sce, "counts") == counts(sce))   ## alternative
## [1] TRUE
identical(assays(sce)[["counts"]], counts(sce))
## [1] TRUE

How many genes are quantified in this experiment? What is the exact number of cells in the object?

dim(sce)
## [1] 5092 3000

While the structure of the scRNAseq data is similar to that of the bulk data, there are also important differences that affect the downstream analysis. One of these differences is that single-cell data is much more sparse; in other words, there are many more zeros in the count matrix from a single-cell experiment than from a bulk experiment. This is due to things such as:

Let’s check the fraction of zeros in our count matrix:

mean(counts(sce) == 0)
## [1] 0.7107205

We also calculate the range of library sizes, noting that these are much smaller than the typical values for bulk RNA-seq:

summary(colSums(counts(sce)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     700    2637    3458    4986    6687   26682

The number of cells in a scRNA-seq data set is typically much (several orders of magnitude) larger than the number of samples in a bulk RNA-seq experiment. Hence, the count matrices can get very large. However, since most of the values are zero, efficient storage modes, where only the non-zero values and the corresponding matrix positions are stored, can be employed. We can make sure that the count matrix in our object is indeed such a sparse matrix (in this particular data set, it is actually provided as a DelayedMatrix, which is beyond the scope of this course to discuss in detail, but which can be suitable for very large data matrices that do not fit in memory).

class(counts(sce))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
counts(sce) <- as(counts(sce), "dgCMatrix")
class(counts(sce))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
counts(sce)[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names 'vis2558', 'spk2385', 'vis1351' ... ]]
##                                         
## Malat1 13 123 15 562 17 564 92 246 72 35
## Rpl41   .  38  .  96  1 151 11  41 37  3
## Rps19   2  49  2  46  1  95 22  38 29  4
## Rpl32   2  31  1  68  1 106 35  39 21  6
## Rps14   .  21  1  61  2 104 23  31 26  6
## COX1    .  27  .  13  .  14 21   3 14  1
## Actb    4  18  4  10  2   7  9   6  3  4
## Rpl13a  .  30  8  67  3 116 24  41 25  9
## Rps5    3  24  5  41  4  95 12  24 22 24
## Eef1a1  2  27 11  60  6  86 12  19 23 37

Properties of scRNA-seq data and quality control

We already noted a couple of differences between scRNA-seq data and bulk data: the former typically contains many more observations, and the count matrix is much more sparse. The low amount of starting material for scRNA-seq experiments also results in a high sampling noise, and a lower correlation among cells than among bulk RNA-seq samples. This can be seen in a scatter plot of the observed counts for two randomly selected cells in our example data set:

## Scatter plot of two sparse cells with similar library sizes
idx <- order(abs(colSums(counts(sce)) - 
                   median(colSums(counts(sce)))))[1:2]
colSums(counts(sce))[idx]
## spk555 vis977 
##   3457   3458
plot(counts(sce)[, idx[1]] + 1, counts(sce)[, idx[2]] + 1, log = "xy")

Note that there are many genes that are not observed in one of the cells, but still have a high number of assigned counts in the other cell.


Exercise

How many genes are detected in at least one cell? Compare that to the number of genes detected in each of the individual cells. What does this tell you?

## Total number of detected genes
sum(rowSums(counts(sce)) > 0)
## [1] 5092
## Genes detected in single cells
summary(colSums(counts(sce) > 0))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   367.0   991.8  1365.5  1473.0  1876.5  3827.0

This indicates that the gene detection is to some extent random, as expected by the sampling process. In other words, it is not always the same genes that go undetected in all cells.


While we don’t observe each gene in each cell, there is still a clear association between the overall expression level of a gene (e.g., total count) and the fraction of cells where it is detected.

## Dropout plot
plot(rowSums(counts(sce)), rowMeans(counts(sce) == 0), log = "x")

To simplify further QC, we use the Bioconductor package scater to calculate a number of summary QC statistics, both for the cells and for the genes. By providing a subset of genes (here, ribosomal protein genes), we can calculate the fraction of counts falling in these genes.

library(scater)
Rpg <- rownames(sce)[grep("^Rp[ls]", rowData(sce)$gene.symbols)]
Rpg
##  [1] "Rpl41"     "Rps19"     "Rpl32"     "Rps14"     "Rpl13a"   
##  [6] "Rps5"      "Rps4x"     "Rplp0"     "Rps9"      "Rps11"    
## [11] "Rps3a1"    "Rpl23"     "Rplp1"     "Rps3"      "Rps15a"   
## [16] "Rpl14"     "Rpl37a"    "Rplp2"     "Rpl6"      "Rps27rt"  
## [21] "Rpl36"     "Rpl8"      "Rpl17"     "Rpl7"      "Rpl4"     
## [26] "Rps15"     "Rpl18"     "Rpl35"     "Rps20"     "Rps24"    
## [31] "Rpl36a"    "Rps17"     "Rpl39"     "Rpl18a"    "Rps16"    
## [36] "Rps10"     "Rps25"     "Rps26"     "Rps18-ps6" "Rpsa"     
## [41] "Rpl9"      "Rpl13"     "Rpl22l1"   "Rpl7a"     "Rps28"    
## [46] "Rpl22"     "Rpl35a"    "Rps21"     "Rpl30"     "Rpl38"    
## [51] "Rps23"     "Rps18"     "Rpl36al"   "Rpl10"     "Rps29"    
## [56] "Rpl10a"    "Rps27"     "Rps27l"    "Rpl28"     "Rpl19"    
## [61] "Rps27a"    "Rpl27a"    "Rpl21"     "Rpl27"     "Rps2"     
## [66] "Rpl15"     "Rps6"      "Rps12"     "Rpl31"     "Rpl7l1"   
## [71] "Rps19bp1"  "Rps6kb1"   "Rps7"      "Rpl5"      "Rpl12"
sce <- addPerFeatureQC(sce)
sce <- addPerCellQC(sce, subsets = list(RPG = Rpg))

This adds a number of columns to the rowData and colData slots of the SingleCellExperiment object:

colnames(rowData(sce))
## [1] "gene.symbols" "gene.counts"  "mean"         "detected"
colnames(colData(sce))
##  [1] "study"                "cell.class"          
##  [3] "library_size"         "detected_genes"      
##  [5] "sum"                  "detected"            
##  [7] "subsets_RPG_sum"      "subsets_RPG_detected"
##  [9] "subsets_RPG_percent"  "total"

For example, we can plot the distribution of library sizes (sum) and the number of detected genes (detected) across the cells

hist(log10(sce$sum), breaks = 30)

hist(sce$detected, breaks = 30)


Exercise

Would you expect there to be an association between the total count and the number of detected genes in a cell? Test your prediction.

plot(sce$sum, sce$detected)


Finally, we can look at the set of genes accounting for the majority of the counts in the data set.

scater::plotHighestExprs(sce, n = 10)

Filtering

Now that we have calculated the QC metrics, we will use them to filter out low-quality cells that will be excluded from the rest of the analysis. The optimal parameters for filtering are debated and likely data set dependent, but a typical approach is to remove cells that fall ‘too far’ from the average cells on one or more of the considered criteria. This makes the implicit assumption that ‘most’ cells are of good quality, which is often sensible. It should also be noted that in some cases, cells that seem to be of bad quality can do so for biological reasons. For example, certain cell types express very few genes, or have a high metabolic rate and consequently express a lot of mitochondrial genes.

Here, we will exclude cells according to two criteria:

For each of these criteria, we exclude cells that are more than 2 and 1 median absolute deviations (MAD) correspondingly from the median across cells, in the direction indicating low quality.

low_detected <- isOutlier(sce$detected, type = "lower", 
                          log = TRUE, nmads = 2)
high_rpg <- isOutlier(sce$subsets_RPG_percent, type = "higher",
                     log = FALSE, nmads = 1)

plot(rank(-sce$detected), sce$detected, col = low_detected + 1)

plot(rank(sce$subsets_RPG_percent), sce$subsets_RPG_percent,
     col = high_rpg + 1)


Exercise

Filter out the cells identified as being of low quality according to the thresholds defined above. How many cells are left?

sce$retain <- !low_detected & !high_rpg
sce <- sce[, sce$retain]
dim(sce)
## [1] 5092 2721

Normalization

Just as for bulk RNA-seq, the raw scRNA-seq counts are not directly comparable across cells due to, e.g., differences in library sizes. Thus, we need to apply a normalization strategy. There are many approaches to normalization of scRNA-seq data (see e.g. Lytal et al 2020 and Cole et al 2019 for comparisons); here, we will use one implemented in the scran package. Similarly to the TMM and DESeq normalization approaches that we have discussed previously, this works by estimating a size factor for each cell, which incorporates the library size as well as a measure of the RNA composition. The bulk RNA-seq methods are sometimes struggling with scRNA-seq due to the large number of zeros; the scran approach solves this by repeatedly pooling multiple cells (which reduces the number of zeros), calculating size factors for the pools, and deriving individual size factors for the cells from those of the pools. After calculating the size factors, we normalize the observed counts and log-transform the values. The new “log counts” are placed in a new assay in sce, named logcounts.

library(scran)
sce <- computeSumFactors(sce, min.mean = 0.1)
sce <- logNormCounts(sce)
assayNames(sce)
## [1] "counts"    "logcounts"

Exercise

Plot the estimated size factors against the total count for each cell (hint: you can access the size factors using the sizeFactors() accessor function). Is there an association? Is this what you expected?

plot(sce$sum, sizeFactors(sce))


Mean-variance relationship

Variation in gene abundance estimates between different cells can be thought of as the convolution of the technical (mainly sampling) and the biological (e.g cell type) sources of variance. Typically one wants to isolate and focus on the biological variance so that differences due to experimental noise have as small an impact as possible on subsequent analyses.
There are different approaches to disentangling the technical and biological variability in a single-cell data set. Some of these assume that “most” genes exhibit only technical variability (i.e., most genes are not differentially expressed between different studied cell types). Other approaches assume that the technical variance follows a Poisson distribution (a common distribution capturing sampling variability in count data), and that deviations from the Poisson expectation corresponds to biological variability.

## Fit a trend
dec.trend <- modelGeneVar(sce)
fit.trend <- metadata(dec.trend)
plot(fit.trend$mean, fit.trend$var, xlab = "Mean of log-expression",
     ylab = "Variance of log-expression")
curve(fit.trend$trend(x), col = "dodgerblue", add = TRUE, lwd = 2)

assay(sce, "normcounts") = sweep(counts(sce), 2, sizeFactors(sce, "UMI"), FUN = "/")
assay(sce, "lognorm") = log2(normcounts(sce)+1)

use.data = normcounts(sce)
# Genes that are detected in at least one cell:
DetectedGenes = which(rowSums(use.data) > 0)
# calculate the genes' mean expresion and coefficient of variation(with pseudocounts):
mean_GE = rowMeans(use.data[DetectedGenes, ] + 1/ncol(use.data))
gene_cv = apply(use.data[DetectedGenes, ], 1, function(x) sd(x)/mean(x + 1/length(x)))
# Log transform:
X1 = log2(mean_GE)
Y1 = log2(gene_cv + 1/ncol(use.data))
# linear fit of log(cv) as a function of log(gene expression):
m = lm(Y1[DetectedGenes] ~ X1[DetectedGenes])
# scatterplot of log(cv) as a function of log(mean expression)
plot(X1[DetectedGenes], Y1[DetectedGenes], xlab = "log2(mean gene expression)", ylab = "log2(coefficent of variation)", 
    main = "mean-variance trend", pch = ".", cex = 2, col = "#000000")
# Add regression line
abline(coef(m)[1], coef(m)[2], col = "red", lwd = 2, lty = 2)
# Slope in m-v trend according to poisson distribution:
abline(0, -0.5, col = "grey", lwd = 2, lty = 2)

In each case, the biological variance of a gene can be estimated as the difference between the total variance and the modeled technical variance. This value can be used to rank the genes, and to select a set of “highly variable” genes to use as the basis of further downstream analysis. Here, we select the top 1,000 genes based on the trend fit, and add these to the metadata of the SingleCellExperiment object.

head(dec.trend[order(dec.trend$bio, decreasing = TRUE), ])
## DataFrame with 6 rows and 6 columns
##            mean     total      tech       bio     p.value         FDR
##       <numeric> <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Acta2   1.64071   6.28171  1.339039   4.94267 2.38243e-21 6.06566e-18
## Trf     1.85707   6.43099  1.497829   4.93316 2.23219e-17 3.78878e-14
## Apoe    1.88850   5.63135  1.521812   4.10954 2.84300e-12 2.06808e-09
## Gpx3    2.15808   5.62590  1.744122   3.88178 6.87905e-09 2.69447e-06
## Krt18   4.64286   4.40725  0.601032   3.80622 5.53954e-59 2.82074e-55
## Krt14   1.38563   4.70182  1.166845   3.53498 5.52146e-15 5.62306e-12
#head(dec.pois[order(dec.pois$bio, decreasing = TRUE), ])
hvg.var <- getTopHVGs(dec.trend, n = 1000)
head(hvg.var)
## [1] "Acta2" "Trf"   "Apoe"  "Gpx3"  "Krt18" "Krt14"
metadata(sce)$hvgs <- hvg.var
# Add overdispersed genes in mean variance trend plot:
plot(X1[DetectedGenes], Y1[DetectedGenes], xlab = "log2(mean gene expression)", ylab = "log2(coefficent of variation)", 
    main = "mean-variance trend", pch = ".", cex = 2, col = "#000000")
abline(coef(m)[1], coef(m)[2], col = "red", lwd = 2, lty = 2)
abline(0, -0.5, col = "grey", lwd = 2, lty = 2)
points(X1[hvg.var],  Y1[hvg.var], col="green",cex=0.3)
points(X1[Rpg],  Y1[Rpg], col="blue",cex=0.3)

Data visualization

For bulk RNA-seq data, we typically use PCA for visualization and exploratory analysis. This can of course also be applied to single-cell RNA-seq data. However, other methods (e.g., tSNE and UMAP) are more commonly used for scRNA-seq. Both tSNE and UMAP are non-linear dimension reduction methods, which focus to a larger extent on retaining small distances. That means that cells that are similar to each other (e.g., cells of the same cell type) tend to be placed close together in the low-dimensional representation, whereas larger distances are potentially less faithfully represented. PCA, on the other hand, tend to put more focus on preserving the large cell-to-cell distances. Importantly, even though PCA is rarely used for visualization of large scRNA-seq data sets in two-dimensional plots, it is often used as a first dimension reduction step, and other approaches such as tSNE and UMAP are subsequently applied to the PCA output.

Here, we therefore first apply PCA to our data set. We supply the set of highly variable genes derived above, to only calculate the principal components based on these. By default, the runPCA function from the scater package will apply the PCA to the ‘logcounts’ assay. We plot the first two principal components (note that we extract 50 components, which will be used as the basis for the tSNE later) and color by the number of detected genes.

sce <- runPCA(sce, exprs_values = "logcounts", ncomponents = 50, 
              subset_row = metadata(sce)$hvgs)
sce
## class: SingleCellExperiment 
## dim: 5092 2721 
## metadata(1): hvgs
## assays(4): counts logcounts normcounts lognorm
## rownames(5092): Malat1 Rpl41 ... Slc45a3 Acsbg1
## rowData names(4): gene.symbols gene.counts mean detected
## colnames(2721): vis2558 spk2385 ... wal2232 wal1284
## colData names(12): study cell.class ... retain sizeFactor
## reducedDimNames(1): PCA
## altExpNames(0):
reducedDimNames(sce)
## [1] "PCA"
plotReducedDim(sce, "PCA", colour_by = "detected")

Next, we apply tSNE, which is a non-linear, stochastic dimension reduction technique that attempts to find a mapping of the data on a low subspace while preserving local distances between cells. The non-linear character of tSNE means that it will often produce projections that better resolve differences between cell groups. The better separation of tSNE comes at the cost of interpretability; while in a tSNE representation similar cells are placed close to each other, longer distances in the representation are not guaranteed to reflect true relationships. This means that it is risky to draw conclusions of “similarity” or “dissimilarity” from the positional relationships of different cell groupings that appear in a tSNE plot In addition, the stochastic nature of tSNE means that every time the algorithm is applied a different representation will be produced unlesss a random seed is set.

set.seed(123)
sce <- runTSNE(sce, dimred = "PCA")
reducedDimNames(sce)
## [1] "PCA"  "TSNE"
plotReducedDim(sce, "TSNE", colour_by = "detected")

plotReducedDim(sce, "TSNE", colour_by = "study")

plotReducedDim(sce, "TSNE", colour_by = "cell.class")

What do you observe? Do the projections reflect mostly originating batch or biological heteregoneity?

We can see the impact of the batch in the TSNE projections generated without taking any steps for integrating the datasets.

We can often identify specific cell types and gain some understanding of the data directly from the visualization (other types of cell type identification, e.g. via cell type clustering, will be discussed in the next lecture). Typically, this would be done by colouring the cells according to the expression level of known marker genes for certain cell types.

## Color by expression of a marker (Csn3)
plotReducedDim(sce, "TSNE", colour_by = "Csn3")

## Color by expression of a marker (Trf)
plotReducedDim(sce, "TSNE", colour_by = "Trf")

Batch effect correction

Batch effect refers to techinal sources of variation introduced to the data during handling/preparation/processing when these are prepared in different batches. This variance distorts the measurements we obtain resulting in the artefactual separation of similar cells across batches:

In the case of single cell sequencing batch effects issues are especially severe because:

Our goals when applying batch correction methods are the following:

Batch effect correction using fast-MNN

Now we will use the matching mutual nearest neighbors technique first proposed by (Haghverdi et al. 2018). This method attempts to identify similar cells across batches after projecting them in a space of reduced dimensions.

Because the correction does not apply a uniform shift across the whole batch, but rather is adjusted to specific cells, it is both a more efficient eraser of batch-orignating variance and less prone to overcorrection. However it also operates on a set of assumptions, mainly that:

  • There is at least a cell population that is common across all pairs of batches.
  • The batch effect variation is almost orthogonal to the biological variation (no high level of confounding) or at least its effects are smaller than the biological variation effect between different cell types.

The function that implements this is fastMNN() from the batchelor package.

d <- 32
FMNN.out <-  batchelor::fastMNN( assays(sce)[["lognorm"]][hvg.var,],  batch=sce$study , d=d ) 
# Notice that the object returned is a single cell experiment.
reducedDim (sce, "PCA.FMNN" ) <- reducedDim(FMNN.out, "corrected") 
reducedDim( sce, "TSNE.FMNN" ) <- Rtsne( reducedDim(sce, "PCA.FMNN"), perplexity = 30, initial_dims=64, pca=FALSE,num_threads=32,theta=0.25)$Y
cowplot::plot_grid(scater::plotReducedDim(sce, colour_by = "study", dimred="TSNE.FMNN" ),
                   scater::plotReducedDim(sce, colour_by = "cell.class", dimred="TSNE.FMNN")
                   )

Clustering

Clustering refers to the task of identifying cell subpopulations with similar expression profiles. The underlying assumption is that these clusters should reflect distinct biological subtypes and is therefore a useful way to summarize the biological heterogeneity of our dataset. It is important to keep in mind that defining the desired resolution of the clustering procedure (i.e the ideal number of clusters) is often subjective and context-dependent. In downstream analysis one can extract more biological insights for the identified groupings by identify “marker genes” that chraracterize them.

Graph based clustering

Graph-based approaches model cell populations as nodes on a network and try to identify well connected node communities (cell groupings). Graph (network) construction is based on drawing edges (connections) between nodes (cells) according to some predefined measure of node to node similarity. A wide range of community-detection algorithms is then available to partition the resulting graph in communities of cells that are better connected to each other then to the rest of the cells.

Graph-based approaches have become extremely popular in the context of scRNA sequencing analysis because they tend to be robust to unwanted variance that produces continuous distortions to the signal (such as sampling variance), they make no assumptions on the shape or the distribution of cells within clusters and they are scalable.

We will now build a Shared Nearest Neighbor Graph (SNN graph) using scran. In such a graph two cells are connected only if at least one of their k nearest neighbors is common.

We will then identify clusters using a community detection method from the package igraph.

# Create a graph of shared nearest neighbors: Two cells are connected iff 
g <- scran::buildSNNGraph(sce, k=25, use.dimred = 'PCA') # Build SNN graph
clust <- igraph::cluster_louvain(g)$membership # use the louvain method for community detection
table(clust)
## clust
##   1   2   3   4   5   6   7   8   9 
## 379 210 201 145 369 496 350 336 235
sce$clust <- factor(clust) 
cowplot::plot_grid(scater::plotTSNE(sce, colour_by = "study"),
                   scater::plotTSNE(sce, colour_by = "cell.class"),
                   scater::plotTSNE(sce, colour_by = "clust")
                   )

Identification of marker genes

As mentioned above, after clustering one can further characterize the derived cell subopulations by identifying genes that are differentially expressed in each cluster. In the case that those genes are over-expressed in a cluster, they are also referred to as marker genes. Although this is the typical type of differential expression analysis in SC RNASeq datasets it is worth mentioning that the character of the data allows us to compare gene expression in cell subpopulations from other points of view. For example one can study in more detail, apart from the mean gene expression,additional characteristics of the gene expression distribution distinct subpopulations, such as its variance, skewness or multimodality:

It is worth noting that these are parameters that are not accessible for study in typical bulk RNAseq datasets.

A related issue is how we come up with the definition of distinct subpopulations in our experiment. We saw above how we can leverage information coming from the data themselves to define subpopulations using clustering approaches. This comes at the expense of being somewhat circular in our analyses (we define clusters using the data and then go back to the same data to perform DE analyses!).

However, the groupings on which we perform DE analysis need not be the results of clustering. They could also come from assignment of cell types using prior information (e.g in cases where the cells have been sorted). Actually, in the examples that will follow we will perform DE analysis on groups defined by the known biological labels of our dataset.

Although SC-specific methods exist for differential gene expression analysis, in practice methods developed for mulk RNAseq or simple statistical tests such as the t-test or its non-parametric counterpart the Wilcoxon test perform as well and often better (see Soneson et al. 2018)!

Before we move on with the analysis of marker genes we should note that there are additional facets of “differential” analyses that can be performed when one is faced with multi-sample SC datasets. In such multi-sample datasets, similar cell populations are derived from somewhat different contexts (e.g different individuals, treatments, gene perturbations). In that case one might want to compare cell type composition between different conditions (“differential abundance analysis”), or to perform differential expression analysis of the same cell type across the different conditions (“differential state analysis”).

For example in the case of our example dataset one could ask what are the differences in the proportion of luminal mature cells in the three samples, or what are the differences in gene expression in the luminal mature cell subtype across the three samples. These questions, although they can be very interesting, are beyond the scope of this course and we refer the students to the relevant section of the osca bioconductor resource: “Multi-sample comparisons”.

Here we will use the scran package in order to idetify marker genes and visualize their expression.

sce.vis <- sce[,sce$study=="vis"] # Extract the cells from one specific study

markers_all <- scran::findMarkers(
  sce.vis, groups = sce.vis$cell.class, 
  test.type = "wilcox", assay.type="lognorm",
  pval.type = "all", direction = "up"
)

The findMarkers function expects as input normalized log-expression values.

The choice of pval.type determines whether the highly ranked genes are those that are DE between the current group and:

Let’s now visualize some of the identified markers:

head(markers_all[[1]])
## DataFrame with 6 rows and 5 columns
##           p.value         FDR summary.AUC AUC.basal
##         <numeric>   <numeric>   <numeric> <numeric>
## Csn3  2.82649e-96 1.43925e-92    0.966263  0.966688
## Trf   5.24109e-92 1.33438e-88    0.998365  0.998365
## Mfge8 4.33794e-86 7.36293e-83    0.978824  0.978824
## Cd14  9.80824e-84 1.24859e-80    0.932963  0.933145
## Plet1 2.26089e-81 2.30249e-78    0.978851  0.989649
## Itm2b 1.11978e-80 9.50321e-78    0.970675  0.970675
##       AUC.luminal_mature
##                <numeric>
## Csn3            0.966263
## Trf             0.999887
## Mfge8           0.995051
## Cd14            0.932963
## Plet1           0.978851
## Itm2b           0.985539
# Violi plots of expression for a set of features on defined groups
scater::plotExpression(sce.vis, features = c("Csn3", "Trf"), 
                       x = "cell.class")

#Colour a projection according to an identified marker gene:
cowplot::plot_grid(scater::plotTSNE(sce.vis, colour_by = "Csn3"),
                   scater::plotTSNE(sce.vis, colour_by = "cell.class"))

#Heatmap for the top two marker genes per cell group:
scater::plotHeatmap(sce.vis, features = unique(unlist(lapply(markers_all, function(w) rownames(w)[1:2]))),
                    columns = colnames(sce.vis)[order(sce.vis$cell.class)],
                    colour_columns_by = "cell.class", cluster_cols = FALSE,
                    show_colnames = FALSE, cluster_rows = FALSE)

Other frameworks

In this lesson we have used Bioconductor packages for the analysis, and represented the data in a SingleCellExperiment container. Just as for bulk RNA-seq, there are also other containers that are often used for single-cell data. The most common alternative is the Seurat object, which is the basis for the workflow provided by the Seurat package. This package provides to a large extent similar capabilities as the Bioconductor packages we have seen in this lecture, and can be used as an alternative. The webpage contains a collection of tutorials (including one for the same data set that we studied here).

Additional resources

References

  1. Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75

  2. Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421

  3. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47.

  4. Leek JT, Johnson WE, Parker HS, Fertig EJ, Jaffe AE, Storey JD, Zhang Y, Torres LC (2019). sva: Surrogate Variable Analysis. R package version 3.32.1

  5. Soneson C, Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat Methods, 15(4):255-261

Session info

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] parallel  stats4    grid      stats     graphics  grDevices
##  [7] utils     datasets  methods   base     
## 
## other attached packages:
##  [1] igraph_1.2.6                dplyr_1.0.7                
##  [3] tidyr_1.1.3                 rsvd_1.0.5                 
##  [5] Rtsne_0.15                  BiocNeighbors_1.8.2        
##  [7] BiocSingular_1.6.0          batchelor_1.6.3            
##  [9] scran_1.18.7                scater_1.18.6              
## [11] ggplot2_3.3.4               SingleCellExperiment_1.12.0
## [13] SummarizedExperiment_1.20.0 Biobase_2.50.0             
## [15] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7        
## [17] IRanges_2.24.1              S4Vectors_0.28.1           
## [19] BiocGenerics_0.36.1         MatrixGenerics_1.2.1       
## [21] matrixStats_0.59.0          BiocStyle_2.18.1           
## [23] png_0.1-7                   knitr_1.33                 
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7              RColorBrewer_1.1-2       
##  [3] tools_4.0.5               bslib_0.2.5.1            
##  [5] utf8_1.2.1                R6_2.5.0                 
##  [7] irlba_2.3.3               ResidualMatrix_1.0.0     
##  [9] vipor_0.4.5               DBI_1.1.1                
## [11] colorspace_2.0-1          withr_2.4.2              
## [13] tidyselect_1.1.1          gridExtra_2.3            
## [15] compiler_4.0.5            DelayedArray_0.16.3      
## [17] labeling_0.4.2            sass_0.4.0               
## [19] scales_1.1.1              stringr_1.4.0            
## [21] digest_0.6.27             rmarkdown_2.9            
## [23] XVector_0.30.0            pkgconfig_2.0.3          
## [25] htmltools_0.5.1.1         sparseMatrixStats_1.2.1  
## [27] limma_3.46.0              highr_0.9                
## [29] rlang_0.4.11              DelayedMatrixStats_1.12.3
## [31] farver_2.1.0              jquerylib_0.1.4          
## [33] generics_0.1.0            jsonlite_1.7.2           
## [35] BiocParallel_1.24.1       RCurl_1.98-1.3           
## [37] magrittr_2.0.1            GenomeInfoDbData_1.2.4   
## [39] scuttle_1.0.4             Matrix_1.3-4             
## [41] Rcpp_1.0.6                ggbeeswarm_0.6.0         
## [43] munsell_0.5.0             fansi_0.5.0              
## [45] viridis_0.6.1             lifecycle_1.0.0          
## [47] stringi_1.6.2             yaml_2.2.1               
## [49] edgeR_3.32.1              zlibbioc_1.36.0          
## [51] dqrng_0.3.0               crayon_1.4.1             
## [53] lattice_0.20-44           cowplot_1.1.1            
## [55] beachmat_2.6.4            locfit_1.5-9.4           
## [57] pillar_1.6.1              glue_1.4.2               
## [59] evaluate_0.14             BiocManager_1.30.16      
## [61] vctrs_0.3.8               gtable_0.3.0             
## [63] purrr_0.3.4               assertthat_0.2.1         
## [65] xfun_0.24                 viridisLite_0.4.0        
## [67] pheatmap_1.0.12           tibble_3.1.2             
## [69] beeswarm_0.4.0            bluster_1.0.0            
## [71] statmod_1.4.36            ellipsis_0.3.2